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Abstract 

Q-learning is a reliable but inefficient off-policy 
temporal-difference method, backing up reward 
only one step at a time. Replacing traces, us- 
ing a recency heuristic, are more efficient but less 
reliable. In this work, we introduce model-free, 
off-policy temporal difference methods that make 
better use of experience than Watkins' Q(A). We 
introduce both Optimistic Q(A) and the temporal 
second difference trace (TSDT). TSDT is partic- 
ularly powerful in deterministic domains. TSDT 
uses neither recency nor frequency heuristics, stor- 
ing (s,a,r, s' ,6) so that off-policy updates can 
be performed after apparently suboptimal actions 
have been taken. There are additional advan- 
tages when using state abstraction, as in MAXQ. 
We demonstrate that TSDT does significantly bet- 
ter than both Q-learning and Watkins' Q(A) in a 
deterministic cliff-walking domain. Results in a 
noisy cliff-walking domain are less advantageous 
for TSDT, but demonstrate the efficacy of Opti- 
mistic Q(A), a replacing trace with some of the ad- 
vantages of TSDT. 

1 Introduction and Background 

The focus of this work is on improving the effi- 
ciency of online, off-policy, temporal difference meth- 
ods | |Sutton and Precup, 1998}|Watkins, 1989) , without com- 
promising stability of convergence. One-step methods such 
as Q-learning/Q(0) are slow but stable. Watkins' Q(A) is un- 
stable for high decay rates and convergence is unproven in the 
general case. 

We introduce two algorithms here. Optimistic Q(A) par- 
tially eliminates a disadvantage of Watkins' Q(A)-that the 
trace must be cleared when apparently suboptimal actions are 
taken. Temporal second difference traces (TSDT) fully elim- 
inate this disadvantage of Watkins' Q(A) and are more stable 
when using state abstraction, as in MAXQ [ [Dietterich, 1998| 
IDietterich, 2000) . 

1.1 Basic Definitions 

A state s refers to a state in which an agent could find itself in 
the course of solving a problem. The state space S refers to 



the set of all states s. An action a refers to a possible course of 
action for an agent. It can be useful to refer to a state-action 
pair, (s, a), meaning to take action a from state s. The ac- 
tion space A(s) refers to the set of all possible actions from 
state s. An absorbing state is one for which A(s) is empty. A 
terminal state is one in which the problem has been solved. A 
non-terminal state is one in which the process is incomplete, 
and therefore it is not an absorbing state. S + refers to the 
subset of S containing only non-terminal states, s' denotes a 
successor state following an action. A reward r is a numerical 
value indicating the value of experiencing the triple (s, a, s'). 
A state transition refers to a state-action pair, a reward, and a 
successor state (s')> or the quadruple (s, a, r, s'). 

A fully observable process is one in which an agent is al- 
ways able to observe the (s,a,r,s') quadruple. The state 
transitions for some problems can be modeled accurately with 
a probabilistic transition function, Vg S ,. The reward func- 
tion, mapping (s, a, s') to a reward, can sometimes be mod- 
eled accurately with a probabilistic reward function, 1Z® S , . A 
Markov decision process (MDP) is a fully observable process 
in which V^ s , and 1Z" S , together provide an accurate model. 

An online learning algorithm learns while gaining experi- 
ence. An offline learning algorithm waits until it is finished 
gaining experience to leam. An on-policy learning algorithm 
learns about the policy it is currently following. An off -policy 
learning algorithm learns about a policy which may be differ- 
ent than that which it is currently following. 

The discount rate 7 [0,1] refers to how quickly an agent 
ceases to care about reward, and is often 1 for episodic/finite 
processes. Discounted return refers to the total discounted 
reward following a state-action pair. Expected return refers to 
the total discounted reward expected to follow a state-action 
pair. 8 refers to the difference between expected discounted 
return and discounted return received. Temporal difference 
(TD) methods are online learning algorithms which update 
proportionally to 5. The learning rate a (0,1] refers to how 
much of the difference is applied in the update. 

A state-action pair is starved if the state is never reached 
or the action is never attempted from the state. An explo- 
ration policy is non-starving if no state-action pair is starved 
as t — > 00. A Q-value Q(s, a) represents the current estimate 
for the expected return for a state-action pair. V(s) refers 
to the maximum Q(s,a) for all a. Q-leaming/Q(0) is 
an off-policy TD method which updates one Q-value per 



step and is guaranteed to converge to the optimal policy 
given a non-starving exploration policy. Watkins' Q(A) is 
an off-policy eligibility trace which updates more than one 
Q-value per step [Watkin s", 1989| . The decay rate A [0, 1] in- 
dicates how quickly entries in the trace cease to be updated as 
they become less recent. 

1.2 Temporal Difference Methods 
On-Policy Backups 

Q{s,a) <*= Q(s,a)+a[r + jQ{s',a') - Q(s,aj] (1) 

Equation Q] is a standard one-step, on-policy TD backup. 
Q(s, a) is updated to be closer to the sum of the immediate 
reward and the discounted return expected one step into the 
future. This can be expressed more readably: 



Q(s,a) ^- r + jQ(s',a') 



(2) 



This backup rule is used by 
Sarsa i Rummery andNiranjan, 1994) , the canonical 
on-policy TD method to use Q-values. It describes the 
behavior of an agent navigating the state space using a 
somewhat greedy policy, and using equation [2] to learn from 
its experience. Being an on-policy algorithm, Sarsa learns 
about the actual policy being followed, incorporating the 
effects of exploration. Sarsa can be guaranteed to converge 
under certain conditions iSinghefa/., 19981. 

Off-Policy Backups 

There are strict requirements for Sarsa to converge to an op- 
timal policy. Off-policy makes it easier to cope with more 
diverse exploration strategies. So long as a is sufficiently low 
and decreased appropriately for stochastic domains, the only 
requirement to guarantee convergence is that the exploration 
policy must be non-starving. 

To accomplish this, Q-learning I Watkins, 19891 backs up 
the best next Q-value rather than the Q-value corresponding 
to the next selected action: 



Q(s,a) 



lV{s') 



(3) 



Learning off -policy has the disadvantage that an agent may 
choose actions that are riskier given the exploration strategy, 
because the effect of exploration is completely removed. An- 
other disadvantage is that techniques for speeding up learning 
become more difficult. Some of these difficulties will be dis- 
cussed in section [T3l 

In exchange for these disadvantages, learning off-policy 
causes an agent's policy to more stably and directly approach 
the optimal policy, regardless of the exploration strategy. Fur- 
thermore, it enables an agent to learn about more than one 
policy at a time. This may not be a great advantage for flat 
reinforcement learning, but it can speed up hierarchical rein- 
forcement learning considerably iKaelbling, 19931. 

Terminal Backups 

Regardless of whether an agent is learning on-policy or 
off-policy, the expression is simpler still for a terminal 
backup: 

Q(s,a)^r (4) 

This is automatic for problems for which all terminal states 
are absorbing states. 



1.3 Beyond One-Step Methods 
Eligibility Traces 

Eligibility traces, such as Watkins' Q(A), are a model-free 
method for using recent memory to speed reinforcement 
learning. If one stores a trace of the state-action pairs taken 
over the course of a task, it is possible to pass <5s back more 
than one step at a time. This can result in a significant in- 
crease in the speed of learning at a cost to stability. 

Sarsa(A) [Rummery and Niranja rT, 1994) is the standard 
on-policy eligibility trace. An entry can persist in the trace 
for arbitrarily many steps for 7 > and A > 0, regardless of 
the rewards encountered. 

Development of an off -policy eligibility trace is more dif- 
ficult. When an agent takes an apparently suboptimal step for 
the sake of exploration, Q-values are updated on the basis of 
the Q-value of an action other than that which is taken. 

Watkins' Q(A) | |Watkins, 1989| is the standard off-policy 
eligibility trace. Entries are cleared from the trace after each 
apparently suboptimal action. Therefore, in the worst case, it 
is no more efficient at performing backups than Q(0). Entries 
can persist much longer in practice. 

Peng's Q(A) ]Peng and Williams, 1996| trades off some 
of the off-policy nature of Watkins's Q(A) in order to al- 
low an entry to persist in the trace for arbitrarily many steps. 
Peng's Q(A) is neither on-policy nor off -policy. 

Dyna-Q 

As an alternative approach to speeding learning, 
Dyna-Q [ Dyna, 1991) builds a model of the environ- 
ment, learning both V^ s , and TZ" S ,, Using this memory, it 
is able to learn from past experience. It can simulate either 
sample or full updates for arbitrary actions from visited 
states. 

1.4 Paper Structure 

Section [2] introduces Optimistic Q(A), an extension of 
Watkins' Q(A). Section|3]introduces the temporal second dif- 
ference trace (TSDT), a different kind of memory trace with 
some of the properties of eligibility traces. Section|4]presents 
experimental results for both algorithms in two cliff-walking 
domains. Section [5] provides a discussion of theory and re- 
sults. 

2 Optimistic Q(A) 

Optimistic Q(A) alleviates the need to completely clear traces 
as in Watkins's Q(A). However, it allows only positive net 
updates to take place after apparently suboptimal actions have 
been taken. 

Optimistic Q(A) as depicted in algorithm Q] is the first 
of two traces developed in this paper. It is based on 
Watkins' Q(A). The algorithm is extended to track return ex- 
perienced past an apparently suboptimal action. If the sum 
of the return experienced so far and the expected best return 
for the actions currently available would increase a Q-value, 
then the Q-value is updated even if an apparently subopti- 
mal action has been taken since the entry was added to the 
trace. This is sound because the update is performed only if 
the apparently suboptimal choice of action ends up appearing 
optimal given information gained later on. 



Algorithm 1 Optimistic Q(A). 0(s, a) stores whether a given 
Q-value must be updated optimistically only. E(s, a) stores 
the partial return experienced since 0(s, a) became True. 

Ensure: Q initialized arbitrarily, e.g., Q(s, a) = 0, 
forVs G S+, Va G ^.(s) 
while an episode is to occur do 

Initialize s {non-terminal, non-starving} 
Initialize e(s, a) = for Vs £ S + , Va G *4(s) 
Choose a from .A(s) {non-starving} 
repeat {for each step of the episode} 
if Q{s, a) < V(s) then 
for all s G <S, a G .A(s) do 

0(s, a) -4= Trite {Instead of e(s, a) 
end for 
end if 

Take action a, observe reward, r, and next state, s' 
for all b G -4(s) do {Replacing trace} 
if b = a then 
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e{s,b) < 
0(s,b) 
else 

e(s,b) < 
end if 
end for 

Choose a' from A(s') {non-starving} 

Son <= 7Q(s') «') - Q(s, a) 
Sob jV(s') - Q(s,a) 
for all s G 5, a G .4(s) do 

if 0(s, a) — False then 
E(s 7 a) <= 

end if 

E(s, a) <S= E(s, a) + e(s, a)r 

S <= E(s, a) + e(s, a)<5 ff 

if 0(s, a) — False or 5 > then 

Q(s, a) f^- Q(s, a) + S 

J5(s, a) <^= e(s, a)((5 

{Optionally 0(s, a) • 
end if 

e(s, a) <S= 7Ae(s, a) 
end for 

s <= s' and a <^ a' 
until s is terminal 
end while 



- ^off) 

False} 



Let us step through the key part of the algorithm. Line |27] 
accumulates the return experienced since the Q-value was last 
updated. Line [28] adds in the off -policy (best) value for the 
current state. Line [29] allows the update to take place only 
if no apparently suboptimal actions have been taken since 
the Q-value was added to the trace, or if the update is pos- 
itive enough to be better than the last update, including the 
off -policy update. Line [30] does the straightforward step of 
updating the Q-value. Line [31] however, stores the negative 
off-policy part of the update, causing the math in lines |29l and 
[30] to work out in the case that updates must be optimistic. 
In the case that updates need not be optimistic, line [25] later 
resets the value. 



Algorithm 2 Temporal Second Difference Trace (TSDT). 
Note that <5 2 is the second difference. S 2 ^ 5 ■ 5. 

Ensure: Q initialized arbitrarily, e.g., Q(s, a) — 0, 
for Vs G S + , Va G A(s) 
1: while an episode is to occur do 
2: Initialize s {non-terminal, non-starving} 
3: Initialize t(s, a) = for Vs G S + , Va G A(s) 
4: repeat {for each step of the episode} 
5: Choose a from A(s) {non-starving} 
6: Take action a, observe reward, r, and next state, s' 
7: for all b e A(s) do {Replacing trace} 
if b = a then 
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t(s,b) < 
r(s, b) < 
s'(s,b) 
S(s,b)< 
else 

t(s,6)^0 
end if 
end for 

for all s e S, a e A(s), t(s, a t 
in reverse t order do 

S <= r(s, a) + jV(s'(s, a)) - Q(s, a) 
5 2 <= S - S(s, a) 
Q(s,a) ^ Q(s,a) + a5 2 
S(s, a) r(s, a) + jV(s'(s, a)) - Q(s, a) 
end for 
s <= s' 
until s is terminal 
end while 



3 Temporal Second Difference Trace 

Eligibility traces and Dyna-Q are well known mecha- 
nisms for speeding up reinforcement learning. Unfortu- 
nately, attempts to apply eligibility traces to off -policy learn- 
ing have been limited in their success. Eligibility traces have 
been cut short [Watkins, 198 9), give n up on being entirely 
off -policy I Peng and Williams, 1 996 1 , and become very com- 
plicated IPrecup et ah, 2000 [ Dyna-Q is both simple and 



powerful but requires the agent to learn a model (Vf s i and 
7£" s /). Here we introduce an algorithm with none of these 
limitations. 

3.1 Not (Quite) an Eligibility Trace 

The temporal second difference trace (TSDT), described in 
algorithm[2] is our version of a memory trace. It is interesting 
in that it isn't an eligibility trace in the usual sense. Rather 
than keeping track of the eligibility or strength of entries in 
the trace, the temporal second difference trace simply keeps 
track of the updates being performed. Using this information, 
it is able to tweak the updates as information becomes avail- 
able. Intuitively, it is as though earlier updates, based on less 
complete information, are redone using more complete infor- 
mation. It does not need to track an eligibility (or strength) 
value at all. 

More specifically, line[l8]calculates the difference between 
the current Q-value and what it is tending towards. Line [19] 
calculates the (second) difference between this difference and 



the previous difference. Line [20] updates the Q- value pro- 
portionally to the second difference. Line|2T]then stores the 
new difference. 6 2 is non-zero if either Q(s, a) or V(s') has 
shifted, though the former can only happen when duplicate 
Q-values appear in the trace. 

3.2 Efficient Propagation of Information 

A parameter, t, is used to facilitate updating the trace in re- 
verse chronological order. As in Watkins' Q(A) and Dyna-Q 
this is not strictly necessary, but it guarantees that informa- 
tion will propagate backwards though the trace as efficiently 
as possible. At each step, the trace simply recalculates the 
difference, 5, and then modifies the Q-value by the second 
difference, S 2 . Updates are as direct as those performed by 
Dyna-Q, but no model is necessary. Regardless, a model 
could be used to eliminate the learning rate parameter. 

For the sake of computational efficiency, one can limit the 
length of the trace or speed the implementation using lazy 
updates and backward replay as with Watkins' Q(A). 

3.3 Some Examples 

Figure Q] depicts an MDP where state C is the focus. Let 
us examine the results of learning with Watkins' Q(A), 
Optimistic Q(A), and the Temporal Second Difference 
Trace (TSDT) across several episodes. We use the syntax 
{so.Siv ■ ■ , terminal reward} for an episode. We use the rates 
a = 1, 7 = 1, and A = 1 where applicable. 

Advantage of Longer Traces 

Having a longer trace gives Optimistic Q(A) and TSDT the 
ability to learn from rewards further in the future than can 
Watkins' Q(A) when learning off-policy. 

Let us examine what happens when episode {A,C,1} is 
followed by episode {A,C,10}. AC results in 5 = — 1, 
causing Q(A,C) = —1. C-l results in 8 = 1, causing 
Q(C,1) = 1 and Q(A,C) = 0. The trace is cleared between 
episodes. AC results in 6 = 0. Finally C-10 results in dif- 
ferent outcomes for the algorithms. Watkins' Q(A) clears the 
trace, Optimistic Q(A) sets all entries to update optimistically 
only, and TSDT simply keeps all entries in the trace. Given 
5 = 10, Q(C,10) = 10 for all algorithms. Having cleared 
its trace, Watkins' Q(A) is unable to update Q(A,C) = 9, 
but Optimistic Q(A) is able to perform the update, given that 
10 - 1 > 0. TSDT also performs the update given that V(C) 
has increased. 



Advantage of the Second Difference 

That TSDT does updates based on a one-step backup 
rule gives it an advantage over Watkins' Q(A) and Op- 
timistic Q(A) which rely instead on a recency heuris- 
tic flSinghef a/.,T996) . 




Figure 1 : A deterministic MDP in which an agent begins in 
state A, moves freely between states A, B, and C, and finally 
terminates in state C. 



Let us examine what happens when episode { A,C, 1 } is fol- 
lowed by episode {A,C,B,C,10}. Let us jump ahead to CB. 
Watkins' Q(A) clears the trace, Optimistic Q(A) sets all en- 
tries to update optimistically only, and TSDT simply keeps 
all entries in the trace. Given 5 = —1, Q(C,B) = —1 for 
all three algorithms. BC results in 5 = -1. Q(B,C) = -1 
for all three algorithms. Q(C,B) = —2 for both eligibility 
traces, but remains unchanged for TSDT given that Q(BA) is 
still 0. Finally C-10 results in different outcomes for all three 
algorithms. Once again Watkins' Q(A) clears the trace, Opti- 
mistic Q(A) sets all entries to update optimistically only, and 
TSDT simply keeps all entries in the trace. Given 6 = 10, 
Q(C,10) = 10 for all algorithms. Watkins' Q(A) is unable 
to update Q(B,C), Q(C,B), or Q(A,C). Optimistic Q(A) up- 
dates Q(B,C) = 9, Q(C,B) = 8, and Q(A,C) = 7. TSDT 
updates Q(B,C) = 9, Q(C,B) = 8, and Q(A,C) = 9. Note 
that here Optimistic Q(A) comes closer to converging than 
does Watkins' Q(A), and that TSDT does even better. 

Gracefully Handling State Abstraction 

Watkins' Q(A) and Optimistic Q(A) both pass back the S for 
the current Q(s, a). Each entry in the trace gets updated in 
proportion to a^e{s, a), e(s, a) being a function of lifetime 
in the trace. Given that a, 7, and A can all be 1, a-fe(s,a) 
can be 1 for the lifetime of the entry in the trace. Thus, if part 
of a Q-value applies to more than one state due to state ab- 
straction, an entry in the trace can experience the 5 arbitrarily 
many times. A small change in expectation for the action just 
taken can be magnified many fold. In TSDT, 6 is not passed 
back at all. Rather, a local <5 is calculated for each entry of the 
trace. If the <5 has changed since the last time the entry was 
updated, the Q-value is updated proportionally to the change 
in 5. In a case that would cause an update to be magnified in 
a regular eligibility trace, both Q(s, a) and V(s'(s, a)) will 
shift in at least one entry of the trace. This will cause the 
second difference to safely approach zero. 

Let us discuss what happens when episode {A,C,1} is fol- 
lowed by episode {A,C,B,C,10} if a value is shared between 
Q(A,C) and Q(B,C). Let us jump ahead to BC. This time 
nothing happens given Q(B,C) = —1 already and S = 0. 
C-10 however causes Optimistic Q(A) to double count S = 
10, first updating Q(B,C) = 9 and then Q(A,C) = 16. 
Watkins' Q(A) avoided this problem by clearing the trace 
earlier, and similarly the problem can be avoided here by 
evicting duplicate Q-values from the trace rather than rely- 
ing purely on the usual replacing trace semantics. How- 
ever, TSDT does not suffer from this problem at all, updat- 
ing Q(B,C) = 9 and then leaving the value unchanged when 
updating Q(A,C) = 9. 

This is not to say that TSDT completely solves all problems 
resulting from having duplicate Q-values in a trace. When 
using a < 1, having duplicate entries can either decrease or 
increase the effective learning rate for the Q-value. Having 
different rewards or transitions for a Q-value in a trace can 



result in an effective decrease in a. Having duplicate rewards 
and transitions for a Q-value in a trace too near to one an- 
other can result in an effective increase in a. For this reason, 
it remains important to eliminate duplicate entries or bound 
TSDT sizes when using a < 1, as with eligibility traces, 
though the problem is less severe for TSDT. These problems 
are entirely absent when using a — 1 for deterministic pro- 
cesses. 

4 Experimental Results 

Experimental results presented are an average of 30 different 
sets of episodes, each starting with a different random seed. 
Further, each plot is smoothed with a running average of 200 
episodes. 

Given that these problems are tractable using value itera- 
tion, the graphs plot the total suboptimality of all actions for 
a given episode. In other words, they plot J2tlQ( s ti a t) ~ 
V(*t)]. 

As both versions of the domain examined are episodic, a 
discount rate of 1 is used. 

4.1 Deterministic Cliff- Walking Domain 

The cliff-walking domain provides a useful testbed because 
it provides many opportunities for failure, some of which are 
very close to the goal. Additionally, it seems intuitive that an 
agent with a maximally effective memory trace should be able 
to learn a good path from any particular state in one episode, 
should it happen to cover the right ground. 

In the domain depicted in figure[2]there are 49 non-terminal 
states, 1 goal state (marked with an 'X'), and 9 failure states. 
Four actions are allowed from each non-terminal state, each 
of which deterministically moves the agent 1 tile in the spec- 
ified direction if possible. Arriving at the goal yields 20 re- 
ward, walking off the cliff yields —20 reward, and all other 
transitions yield —1 reward. 

All four agents tested in this domain use a fixed 
epsilon-greedy exploration strategy, with e = 0.3, prevent- 
ing any of the agents from behaving optimally during ex- 
ploration. Q-leaming, an established off-policy algorithm 
guaranteed to converge on an optimal policy, nearly finishes 
converging after 1500 episodes. The temporal second differ- 
ence trace (TSDT) nearly finishes converging in in only 1000 
episodes, and without the early dip in performance experi- 
enced by Watkins' Q(l). Both eligibility trace methods, how- 
ever, result in divergent behavior for hundreds of thousands 
of episodes. 
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Figure 2: The cliff-walking domain explored here, with and 
without noise affecting move actions. 
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Figure 3: This plot depicts the online performance of agents 
following a fixed epsilon-greedy exploration strategy in a 
cliff-walking domain. 

While this plot depicts a running average (over 200 
episodes) of 30 different sets of 2000 episodes for each tem- 
poral difference method, more detailed statistics are of some 
interest. Across all 30 policies developed for each of the 
49 different initial conditions, Q-learning and TSDT have 
optimal policies for all instances. Watkins' Q(l) and Opti- 
mistic Q(l) have optimal policies for only 360 and 289 re- 
spectively of the 1470 instances, however. 

Results not reproduced here indicate that A s» 0.2 allows 
the eligibility trace methods to stably converge on optimal 
policies, although not much faster than Q-learning. 

4.2 Noisy Cliff- Walking Domain 

We now introduce a version of the domain in which actions 
may result in different state transitions (and the correspond- 
ing rewards) with some probability. For this experiment, the 
transition will behave normally with probability 0.8 or its di- 
rection will be rotated 90° clockwise or counter-clockwise 
with equal probabilities 0.1. 

Strictly speaking, carefully decreasing the learning rate is 



Online Learning in a Noisy Cliff- Walking Domain 
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Figure 4: This plot depicts the online performance of agents 
following a decreasing epsilon-greedy exploration strategy in 
a noisy cliff-walking domain. 



required for convergence in a stochastic MDP. However, we 
will use a fixed a = 0.05 and the same exploration strategy 
as in the previous experiment, acknowledging that unlearning 
is possible. 

Here Optimistic Q(l) and Watkins' Q(l) initially do very 
well, compensating for the low learning rates. However, 
TSDT and Q-learning eventually overtake both eligibility 
traces. At the end of 5,000 episodes, Q-learning has the 
best policy, Optimistic Q(l) has finally caught up with TSDT 
again, and Watkins' Q(l) has the least optimal policy. Due to 
the decreased learning rate, all methods are more even in the 
stochastic domain, but both Q-learning and TSDT approach 
equilibrium more directly than the eligibility trace methods. 

5 Discussion 

Temporal second difference traces (TSDT) have been demon- 
strated to be immune to some of the flaws of Watkins' Q(A). 
TSDT has no need to clear traces to ensure the validity of 
off-policy learning when apparently suboptimal actions are 
taken. However, care must still be taken to bound the number 
of duplicate Q-values when learning stochastic processes. 

Optimistic Q(A) is a replacing eligibility trace with some 
of the advantages of TSDT. Traces do not need to be cleared 
after apparently suboptimal actions, and increasing expec- 
tations of reward can overcome the penalty for this explo- 
ration. However, decreasing expectations of reward cannot 
generally be passed back through the trace beyond apparently 
suboptimal actions as in TSDT. Additionally, the advantages 
of TSDT with respect to state abstraction are lost. Despite 
these limitations, Optimistic Q(A) seems to be effective when 
learning stochastic processes. 

TSDT has been demonstrated to converge on the optimal 
policy for the deterministic cliff- walking domain significantly 
faster than Q-learning, as opposed to Q(l) which exhibits sig- 
nificant divergent behavior. TSDT has been shown to be more 
comparable to Q-learning than Q(l) in a noisy cliff-walking 
domain as well, though the Q(l) methods do better early on. 
Additionally, Optimistic Q(l) outperformed Watkins' Q(l) 
slightly in both cliff-walking domains. 

We expect TSDT to perform as least as well as Q(A) for 
deterministic domains. However, it may not make as much 
use of information as Q(A) when learning rates are low and 
the traces are long. The amount of return used by TSDT de- 
creases exponentially with respect to the learning rate, as op- 
posed to Q(A) which decreases exponentially only if A < 1. 
Therefore it is important for the efficiency of TSDT to use 
higher learning rates whenever possible. It may be that de- 
creasing the learning rate per Q-value with respect to l/n 
could be sufficient to significantly improve the efficiency of 
TSDT in stochastic domains, though alternatives which may 
keep a higher longer could do better still. 

We believe that using TSDT instead of Q-learning and Op- 
timistic Q(A) instead of Watkins' Q(A) should be reasonable 
regardless of the domain. The only downside is increased 
computational cost. 

We have done additional research in the area of hi- 
erarchical reinforcement learning, looking at the taxicab 
domain | |Dietter ich, 19981 and the fickle taxicab domain 



| |Dietterich, 2000| . Our research has continued to focus on 
off-policy learning. We expect to present this additional work 
using TSDT in a future publication. 
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